Let's scrape some death row data

Texas executes a lot of criminals, and it has a web page that keeps track of people on its death row.

Using what you've learned so far, let's scrape this table into a CSV. Then we're going write a function to grab a couple pieces of additional data from the inmates' detail pages.

Import our libraries


In [ ]:
import csv
import time

import requests
from bs4 import BeautifulSoup

Fetch and parse the summary page


In [ ]:
# the URL to request
URL = 'https://www.tdcj.state.tx.us/death_row/dr_offenders_on_dr.html'

# get that page
page = requests.get(URL)

# turn the page text into soup
soup = BeautifulSoup(page.text, 'html.parser')

# find the table of interest
table = soup.find('table')

Loop over the table rows and write to CSV


In [ ]:
# find all table rows (skip the first one)
rows = table.find_all('tr')[1:]

# open a file to write to
with open('death-row.csv', 'w') as outfile:
    
    # create a writer object
    writer = csv.DictWriter(outfile, fieldnames=['id', 'link', 'last', 'first', 'dob', 'sex',
                                                 'race', 'date_received', 'county', 'offense_date'])
    
    # write header row
    writer.writeheader()

    # loop over the rows
    for row in rows:
        
        # extract the cells
        cells = row.find_all('td')
        
        # offense ID
        off_id = cells[0].string
        
        # link to detail page
        link = 'https://www.tdcj.state.tx.us/death_row/' + cells[1].a['href']
        
        # last name
        last = cells[2].string
        
        # first name
        first = cells[3].string
        
        # dob
        dob = cells[4].string
        
        # sex
        sex = cells[5].string
        
        # race
        race = cells[6].string
        
        # date received
        date_received = cells[7].string
        
        # county
        county = cells[8].string
        
        # offense date
        offense_date = cells[9].string
        
        # write out to file
        writer.writerow({
            'id': off_id,
            'link': link,
            'last': last,
            'first': first,
            'dob': dob,
            'sex': sex,
            'race': race,
            'date_received': date_received,
            'county': county,
            'offense_date': offense_date
        })

Let's write a parsing function

We need a function that will take a URL of a detail page and do these things:

  • Open the detail page URL using requests
  • Parse the contents using BeautifulSoup
  • Isolate the bits of information we're interested in: height, weight, eye color, hair color, native county, native state, link to mugshot
  • Return those bits of information in a dictionary

A couple things to keep in mind: Not every inmate will have every piece of data. Also, not every inmate has an HTML detail page to parse -- the older ones are a picture. So we'll need to work around those limitations.

We shall call our function fetch_details().


In [ ]:
def fetch_details(url):
    """Fetch details from a death row inmate's page."""

    # create a dictionary with some default values
    # as we go through, we're going to add stuff to it
    # (if you want to explore further, there is actually
    # a special kind of dictionary called a "defaultdict" to
    # handle this use case) =>
    # https://docs.python.org/3/library/collections.html#collections.defaultdict

    out_dict = {
        'Height': None,
        'Weight': None,
        'Eye Color': None,
        'Hair Color': None,
        'Native County': None,
        'Native State': None,
        'mug': None
    }
    
    # partway down the page, the links go to JPEGs instead of HTML pages
    # we can't parse images, so we'll just return the empty dictionary
    if not url.endswith('.html'):
        return out_dict
    
    # get the page
    r = requests.get(url)
    
    # soup the HTML
    soup = BeautifulSoup(r.text, 'html.parser')

    # find the table of info
    table = soup.find('table', {'class': 'tabledata_deathrow_table'})
    
    # target the mugshot, if it exists
    mug = table.find('img', {'class': 'photo_border_black_right'})
    
    # if there is a mug, grab the src and add it to the dictionary
    if mug:
        out_dict['mug'] = 'http://www.tdcj.state.tx.us/death_row/dr_info/' + mug['src']

        
    # get a list of the "label" cells
    # on some pages, they're identified by the class 'tabledata_bold_align_right_deathrow'
    # on others, they're identified by the class 'tabledata_bold_align_right_unit'
    # so we pass it a list of possible classes
    label_cells = table.find_all('td', {'class': ['tabledata_bold_align_right_deathrow',
                                                  'tabledata_bold_align_right_unit']})

    # gonna do some fanciness here in the interests of DRY =>
    # a list of attributes we're interested in -- should match exactly the text inside the cells of interest
    attr_list = ['Height', 'Weight', 'Eye Color', 'Hair Color', 'Native County', 'Native State']

    # loop over the list of label cells that we targeted earlier
    for cell in label_cells:
        
        clean_label_cell_text = cell.text.strip()
        
        # check to see if the cell text is in our list of attributes
        if clean_label_cell_text in attr_list:
            
            # if so, find the value -- go up to the tr and search for the other td --
            # and add that attribute to our dictionary
            value_cell_text = cell.parent.find('td', {'class': 'tabledata_align_left_deathrow'}).text.strip()
            
            out_dict[clean_label_cell_text] = value_cell_text

    # return the dictionary to the script
    return(out_dict)

Putting it all together

Now that we have our parsing function, we can:

  • Open and read the CSV files of summary inmate info (the one we just scraped)
  • Open and write a new CSV file of detailed inmate info

As we loop over the summary inmate data, we're going to call our new parsing function on the detail URL in each row. Then we'll combine the dictionaries (data from the row of summary data + new detailed data) and write out to the new file.


In [ ]:
# open the CSV file to read from and the one to write to
with open('death-row.csv', 'r') as infile, open('death-row-details.csv', 'w') as outfile:
    
    # create a reader object
    reader = csv.DictReader(infile)
    
    # the output headers are goind to be the headers from the summary file
    # plus a list of new attributes
    headers = reader.fieldnames + ['Height', 'Weight', 'Eye Color', 'Hair Color',
                                   'Native County', 'Native State', 'mug']

    # create the writer object
    writer = csv.DictWriter(outfile, fieldnames=headers)
    
    # write the header row
    writer.writeheader()
    
    # loop over the rows in the input file
    for row in reader:
        
        # print the inmate's name (so we can keep track of where we're at)
        # helps with debugging, too
        print(row['first'], row['last'])
        
        # call our function on the URL in the row
        deets = fetch_details(row['link'])        
        
        # add the two dicts together by
        # unpacking them inside a new one
        # and write out to file
        writer.writerow({**row, **deets})
        
        time.sleep(2)
    
    print('---')
    print('Done!')